Find the best number of clusters with k_means and agglomerative clustering
KMeans with varying number of clusters, from 2 to 10: for each value of k kAgglomerativeClustering and DBSCANimport numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from mpl_toolkits import mplot3d
from sklearn.model_selection import ParameterGrid
import warnings
warnings.filterwarnings("ignore")
random_state = 42 # This variable will be used in all the procedure calls allowing a random_state parameter
# in this way the running can be perfectly reproduced
# just change this value for a different experiment
Check the shape and plot the content
We observe that the distributions of values are definitely skewed: in the columns from Fresh to Delicassen the values are highly concentrated on the right, but there are always outliers, frequently in a very large range.
Clustering is more effective in absence of outliers and with all the variables distributed in similar ranges, for this reason, we will execute two transformations:
Fresh to the column Delicassen with PowerTransformer0:1from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
Show the result of the transformation
Now the effect of outliers is reduced, and we compute the clustering
Test KMeans with varying number of clusters, from 2 to 10
Prepare the results list that will contain pairs of inertia and silhouette_score for each value of k, then, for each value of k
KMeansfit and predictinertia_ of the fitted modelsilhouette_score from sklearn.metrics using as arguments the data and the fitted labels, we will fill the variable silhouette_scores
k¶
The two elbow points of inertia would suggest as cluster number 3 or 4, slightly more pronounced in 3. Silhouette has a maximum on 4, but the increase with respect to 3 is very small.
We will choose k=4
Show the distribution of samples in the clusters with a pie chart
The silhouette score ranges from -1 (worst) to 1 (best); as a rule of thumb, a value greater than 0.5 should be considered acceptable.
We will try a grid of parameter configurations, with the number of clusters in the range 2:10 and the four linkage methods available in the sklearn implementation of AgglomerativeClustering.
from sklearn.cluster import AgglomerativeClustering
The top five results have a very similar silhouette score, we will choose the setting with 4 clusters, as for k-means, and the linkage giving the best result with 4 clusters, that is ward. This is the result record with index 2 (the record index is the unnamed column at the very left of the dataframe output
Show the distribution of data in the clusters
The solution with the Agglomerative Clustering in this case provides a result very similar to that of kmeans.
It is interesting to compare more deeply the results of the two clustering models.
The function pair_confusion_matrix computes the number of pairs of objects that are in the same clusters or in different clusters in two different clustering schemes.
The result is given in a 2x2 matrix, the smaller the numbers out of the main diagona, the better the match.
For easier readability, we divide all the elements of the matrix by the sum of all the elements of the matrix, in this way, the matrix elements are normalized to 1.
from sklearn.metrics import pair_confusion_matrix
A short indicator of the match percentage can be optained as a sum of the elements of the main diagonal.
from sklearn.cluster import DBSCAN
Show the distribution of data in the clusters
ParameterGrid¶prepare the dictionary of the parameter ranges and produce the list of parameter settings to test with the function ParameterGrid
Arrange DBSCAN results in a dataframe, for easier presentation and filtering
dbscan_out = pd.DataFrame(columns = ['eps','min_samples','n_clusters','silhouette', 'unclust%'])
Show the cluster sizes
Use the function plot_silhouette contained in the module plot_silhouette_w_mean (provided with the notebook) and from sklearn.metrics import the function silhouette_samples providing the silhouette score for each sample
from plot_silhouette_w_mean import plot_silhouette # python script provided separately
Hint: use help(plot_silhouette) for the meaning of the parameters
Hint: for DBSCAN you should exclude the rows of noise
from plot_silhouette_w_mean import plot_silhouette # python script provided separately
For each of the clustering schemes show how the attribute values are distributed in the clusters